40 research outputs found
Hierarchical Surface Prediction for 3D Object Reconstruction
Recently, Convolutional Neural Networks have shown promising results for 3D
geometry prediction. They can make predictions from very little input data such
as a single color image. A major limitation of such approaches is that they
only predict a coarse resolution voxel grid, which does not capture the surface
of the objects well. We propose a general framework, called hierarchical
surface prediction (HSP), which facilitates prediction of high resolution voxel
grids. The main insight is that it is sufficient to predict high resolution
voxels around the predicted surfaces. The exterior and interior of the objects
can be represented with coarse resolution voxels. Our approach is not dependent
on a specific input type. We show results for geometry prediction from color
images, depth images and shape completion from partial voxel grids. Our
analysis shows that our high resolution predictions are more accurate than low
resolution predictions.Comment: 3DV 201
Pose Induction for Novel Object Categories
We address the task of predicting pose for objects of unannotated object
categories from a small seed set of annotated object classes. We present a
generalized classifier that can reliably induce pose given a single instance of
a novel category. In case of availability of a large collection of novel
instances, our approach then jointly reasons over all instances to improve the
initial estimates. We empirically validate the various components of our
algorithm and quantitatively show that our method produces reliable pose
estimates. We also show qualitative results on a diverse set of classes and
further demonstrate the applicability of our system for learning shape models
of novel object classes
Viewpoints and Keypoints
We characterize the problem of pose estimation for rigid objects in terms of
determining viewpoint to explain coarse pose and keypoint prediction to capture
the finer details. We address both these tasks in two different settings - the
constrained setting with known bounding boxes and the more challenging
detection setting where the aim is to simultaneously detect and correctly
estimate pose of objects. We present Convolutional Neural Network based
architectures for these and demonstrate that leveraging viewpoint estimates can
substantially improve local appearance based keypoint predictions. In addition
to achieving significant improvements over state-of-the-art in the above tasks,
we analyze the error modes and effect of object characteristics on performance
to guide future efforts towards this goal
SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction
We propose SparseFusion, a sparse view 3D reconstruction approach that
unifies recent advances in neural rendering and probabilistic image generation.
Existing approaches typically build on neural rendering with re-projected
features but fail to generate unseen regions or handle uncertainty under large
viewpoint changes. Alternate methods treat this as a (probabilistic) 2D
synthesis task, and while they can generate plausible 2D images, they do not
infer a consistent underlying 3D. However, we find that this trade-off between
3D consistency and probabilistic image generation does not need to exist. In
fact, we show that geometric consistency and generative inference can be
complementary in a mode-seeking behavior. By distilling a 3D consistent scene
representation from a view-conditioned latent diffusion model, we are able to
recover a plausible 3D representation whose renderings are both accurate and
realistic. We evaluate our approach across 51 categories in the CO3D dataset
and show that it outperforms existing methods, in both distortion and
perception metrics, for sparse-view novel view synthesis.Comment: project page: https://sparsefusion.github.io/ v2: typo corrected in
table 3 v3: added ablatio
Articulation-aware Canonical Surface Mapping
We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that
indicates the mapping from 2D pixels to corresponding points on a canonical
template shape, and 2) inferring the articulation and pose of the template
corresponding to the input image. While previous approaches rely on keypoint
supervision for learning, we present an approach that can learn without such
annotations. Our key insight is that these tasks are geometrically related, and
we can obtain supervisory signal via enforcing consistency among the
predictions. We present results across a diverse set of animal object
categories, showing that our method can learn articulation and CSM prediction
from image collections using only foreground mask labels for training. We
empirically show that allowing articulation helps learn more accurate CSM
prediction, and that enforcing the consistency with predicted CSM is similarly
critical for learning meaningful articulation.Comment: To appear at CVPR 2020, project page
https://nileshkulkarni.github.io/acsm
Visual Affordance Prediction for Guiding Robot Exploration
Motivated by the intuitive understanding humans have about the space of
possible interactions, and the ease with which they can generalize this
understanding to previously unseen scenes, we develop an approach for learning
visual affordances for guiding robot exploration. Given an input image of a
scene, we infer a distribution over plausible future states that can be
achieved via interactions with it. We use a Transformer-based model to learn a
conditional distribution in the latent embedding space of a VQ-VAE and show
that these models can be trained using large-scale and diverse passive data,
and that the learned models exhibit compositional generalization to diverse
objects beyond the training distribution. We show how the trained affordance
model can be used for guiding exploration by acting as a goal-sampling
distribution, during visual goal-conditioned policy learning in robotic
manipulation.Comment: Old Paper; Presented in ICRA 202